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ABSTRACT 

Learning a more distributed representation of the input fea- 
ture space is a powerful method to boost the performance 
of a given predictor. Often this is accomplished by parti- 
tioning the data into homogeneous groups by clustering so 
that separate models could be trained on each cluster. In- 
tuitively each such predictor is a better representative of 
the members of the given cluster than a predictor trained 
on the entire data-set. Previous work has used this basic 
premise to construct a simple yet strong bagging strategy. 
However, such models have one significant drawback: In- 
stances (such as students) are clustered while features (tu- 
tor usage features/items) are left alone. One-way cluster- 
ing by using some objective function measures the degree 
of homogeneity between data instances. Often it is noticed 
that features also influence final prediction in homogeneous 
groups. This indicates a duality in the relationship between 
clusters of instances and clusters of features. Co-Clustering 
simultaneously measures the degree of homogeneity in both 
data instances and features, thus also achieving clustering 
and dimensionality reduction simultaneously. Students and 
features could be modelled as a bipartite graph and a si- 
multaneous clustering could be posed as a bipartite graph 
partitioning problem. In this paper we integrate an effective 
bagging strategy with Co-Clustering and present results for 
prediction of out-of-tutor performance of students. We re- 
port that such a strategy is very useful and intuitive, even 
improving upon performance achieved by previous work. 

Keywords 

Out-of- Tutor Prediction, Dynamic Assessment, Spectral Co- 
clustering, Ensemble Learning, Bootstrap-Aggregation 

1. INTRODUCTION 

A significantly large student population would usually have 
a wide variation in learning rates and knowledge levels. While 
there are numerous reasons for this diversity, three major 
reasons are related to: the type of instruction or help they 


respond best to, the way they are oriented towards learning 
and their levels of intellectual development Need- 

less to say, such differences would be reflected in the way 
students interact with educational software, making educa- 
tional data quite difficult to mine well. Specifically there 
are many educational data mining problems where the end 
goal is to predict the performance of a student on a given 
in-tutor or out-of-tutor task. In-tutor tasks include pre- 
dicting the probability that a student will answer an item 
correctly after attempting a sequence of similar questions 
whereas out-of-tutor tasks include being to predict student 
performance in post-tests based on the data from their tutor 
usage. 

The idea that students are quite different makes it appar- 
ent that perhaps it is not such a good idea to fit a global 
prediction model over the entire dataset for making predic- 
tions. In spite of the differences between students, educators 
commonly observe that students actually lie in very rough 
groups and have similar pedagogical needs. Taking a cue 
from this intuition, the task of prediction can be improved 
by clustering students into somewhat homogeneous groups 
and then training a separate predictor for each group. Such 
a predictor would obviously be a much better representative 
of students in that cluster as compared to a predictor which 
is fit on the entire dataset. For example, it makes sense 
to have a different model for students roughly classified as 
fast learners and a different model for slow learners than the 
same for both. This rather simple strategy of grouping stu- 
dents together and then modeling them separately can lead 
to improved performance in prediction and perhaps even 
better interpret-ability. 

While the above approach is compelling, there are two ma- 
jor issues with it. Firstly, while it is useful to model students 
as belonging to different groups, it is also known that such 
groupings are quite fuzzy and approximate. Students might 
actually possess different characteristics in varying degrees 
and what really sets them apart are certain dominant char- 
acteristics. For example students classified as fast learn- 
ers might actually be slow learners in certain skills. A fast 
learner might also belong to the group of students that are 
good at recalling information etc. Thus, such complex char- 
acteristics can not be possibly modelled by simply clustering 
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students to a certain limit and then training models for each 
cluster. This “spread” of features in a student across groups 
also needs to be captured to make a distributed predictive 
model such as the above more meaningful. Such an issue 
can be resolved by varying the granularity of the clustering 
and training separate models each time so that such features 
can be accounted for. A simple yet quite effective strategy 
to do so was proposed by the authors and was seen to work 
quite well both in educational contexts (in-tutor predictions 
[3], out-of-tutor predictions [4] , [5 ) and more generally [6|. 
The second problem with the above approach is that cluster- 
ing is implicitly suggested to be one-way i.e only clustering 
students. But this need not necessarily be the case and only 
clustering students would consider only half of the story. As 
an example, consider a matrix in which the rows represent 
students and the columns represent their responses to cer- 
tain items. Clearly, clustering students would depend upon 
their item distributions, implicitly suggesting that for cer- 
tain students certain items are more important than others. 
Similarly if items were to be clustered, they would depend 
on which groups of students get them correct (or incorrect) 
most frequently. This indicates a duality between these two 
clusterings, which on simultaneous co-clustering could be 
very useful in answering many research questions. Co clus- 
tering of such a student versus item matrix would pair clus- 
ters of student proficiency to clusters of item performance 
which could be seen as a sort of a subject treatment interac- 
tion. This idea could be extended to the more general case 
of students and features rather than just items. In this work 
we use this idea of co-clustering students and their tutor in- 
teraction features and interleave it with the bagging strategy 
which was used with clustering This combined 

approach is then used to predict the post-test scores of stu- 
dents. 

This paper is organized as follows: In Section [2] we discuss 
the idea of co clustering in more detail and that co cluster- 
ing could be posed as a bipartite graph partitioning problem. 
In Section [3] we describe a general framework in which we 
interleave co clustering with the idea of generating an en- 
semble. In Section [4] we describe the experimental results 
which demonstrate the validity of this approach. In Section 
[5] we discuss the results and also describe some avenues for 
further work. 


2. CO-CLUSTERING 

Clustering is a fundamental tool from unsupervised learn- 
ing for data analysis that groups together relatively homo- 
geneous objects. The central idea for clustering is that every 
object could be specified by a feature vector (or a point in 
the feature space) and then the degree of homogeneity be- 
tween them could be measured by some objective function 
that uses these feature vectors. For example in k-means 
clustering: the points are grouped so as to minimize a dis- 
tortion function, which is basically the sum of distances of 
all points from their assigned cluster centroids 7 . 
Clustering algorithms are one-way, i.e. one dimension of the 
data (say the rows of the data matrix) is clustered based 
on the similarities measured on the second dimension (say 
the columns). As pointed out in the previous section it 
might be desirable, quite frequently, to cluster along both 
the dimensions simultaneously, exploiting the apparent du- 
ality between them. Such simultaneous clustering can of- 


ten offer interesting insights about the nature of interaction 
between the clusters at both the dimensions [8]. This util- 
ity is fast making co-clustering a fundamental tool for data 
analysis as is indicated by its widespread use in text and 
document mi ning |9j, 10 ; bioinformatics and gene expres- 
sion analysis 11 , [L2]; collaborative filtering jl3] and many 
others practical applications. 

While there are now a number of approaches to co-clustering 
such as based on spectral graph theory [ 10 and informa- 
tion theory 1 14] , 15 , each with its advantages, we consider 


the approach proposed by Dhillon [10 which formulates the 
problem of co-clustering as a bipartite graph partitioning 
problem. We now briefly describe this approach starting 
with the relevant notation and definitions. 


2.1 Notation and Definitions 

A graph is represented as G = (V, E) where V represents the 
set of vertices and E represents the set of all edge weights 
Eij, where Eij is the edge weight between vertices {i,j}- 


Definition 1. The n x n Weighted Adjacency Matrix 
of an undirected graph is defined as the matrix 
If niij = 0 it implies that vertices Vi and Vj are not connected 
by an edge. If m-ij 7^ 0 it implies that the vertices {i,j} are 
connected and niij is the corresponding edge weight. Since 
the graph is undirected, mij = m rl necessarily. 


Definition 2. Given the weighted adjacency matrix of a 
graph and a partition of the vertex set V into two disjoint 
subsets Vi and V2, the cut between these two subsets is 
defined as: 

cut(Vi, V2) = ^2 

iEVi ,j € V2 


An undirected bipartite graph is a triple represented by 
G = ( S , F, E) where S and T are two sets of vertices and E 
is the set of edges. Since it is a bipartite graph one end of the 
edges in set E have an endpoint in S and another in F . In 
our case the set S is the set of students while the set F is the 
set of features. The set of features could readily be seen as 
a set of item-responses as well. If T is the set of items, then 
an edge between Si and fj exists if that item was answered 
correctly by a student and not otherwise. More generally, 
if F is just a set of features, then the edge {si, ft} simply 
represents the value of that feature scaled between 0 and 1 
for that student. Given this definition of a Bipartite Graph, 
now we define the adjacency matrix of the same. 

Consider a m x n dimensional data matrix with students on 
the rows and the items or features on the columns. Let’s 
suppose this matrix is given by A. Clearly, the adjacency of 
the bipartite graph is given as: 


M = 


0 

A t 


A 

0 


The zeroes on the top-left and the bottom-right sub-matrices 
signify the absence of connections amongst the elements of 
S and F respectively (since connections in a bipartite graph 
can only run between S and F). The matrix M is repre- 
sented such that taking A at the top right corner and A T at 
the bottom left implies that the first m rows of M represent 
the set of students and the next n rows represent the set of 
features or items. 
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Suppose the Bipartite Graphs (whose adjacency matrix is 
defined above) is partitioned into k clusters Vi ..... fy. Given 
this partitioning, a corresponding set of student clusters 
5i . . . Sk and corresponding feature clusters T\. . .Tk would 
also be obtained. It could be intuitively seen that the best 
possible such set of clustering for all such pairs would be 
when the sum of all edges which cross between clusters is 
the minimum possible. As defined by 10] this corresponds 
to: 


graph as such an optimal clustering can be found using a 
laplacian. Using the definition of L = D — M as defined 
above and also the definitions of D and M. The laplacian 
may be written as: 


and 



-A 

D 2 


cut{Si UTi,...,5 fc U Jfc) = minv 1 ,...,v k cut ( Vi , . . . , V k ) 



0 

D 2 


Where Vi, . . . , Vk represents a k-partitioning of the graph. 
The above definition leads us to the Bipartite Graph Parti- 
tioning problem: 

Definition 3. The bipartite graph partitioning prob- 
lem: Given a graph as defined earlier and subsets of V which 
are almost of equal size, say Vj* and V 2 . The required par- 
tition is 


cwt(Vr,V 2 *) = mmvi,v 2 cut(Vi, V 2 ) 


The bipartite graph partitioning problem as defined above is 
NP-Complete. However, a good relaxation to this problem 
is given by spectral graph bi-partitioning. This relaxation 
is achieved via the graph Laplacian. The laplacian L of a 
graph is a symmetric positive semi-definite matrix such that 
its un-normalized form is given by L = D — M where D is 
the degree matrix and M is the adjacency matrix as defined 
earlier. Note that D is only a diagonal matrix while M is 
a symmetric matrix with all zeros in the diagonal. Thus, 
the Laplacian encodes both D and M in it and has many 
useful properties such as being positive semi-definite, which 
make it very useful for tasks such as clustering [24]. One 
property of the Graph Laplacian that make it particularly 
suitable for clustering are related to the properties of its 
spectrum. The spectra of the Graph Laplacian unfolds the 
data manifold to give an lower dimensional embedding which 
can give “better” clustering results. 

Returning to the Bipartite Graph Partitioning Problem, as 
demonstrated by Dhillon 10, and Mohar [24], the second 
eigenvector of the generalized eigenvalue problem Lz = XDz 
gives a real relaxation to the problem of finding the mini- 
mum normalized cut Q(Vi,V 2 ). The normalized cut is ba- 
sically a cut that favours finding balanced partitions i.e. if 
the cut of two different partitions is the same, then the nor- 
malized cut is smaller for that partition which is more bal- 
anced. Thus it favours partitions that are balanced and have 
a small cut value. Clearly, the normalized cut is more suit- 
able for tasks such as clustering [16 . Note that this relates 
to the ideas above relating to the optimal bi-partitionings 
in the following way: We want balanced clusterings with 
minimum cut for solving the bipartite graph partitioning 
problem, which would also be the optimal clustering for us. 
Thus looking at the Laplacian of the bipartite graph might 
provide such a clustering. 


2.2 Spectral Co-Clustering 

Given the definitions and notions in the previous section, 
in this section we state an algorithm lo] for finding the 
optimal co-clusters {<Si U T 1 } , . . . , {Sk U J-k} as mentioned 
above. For that we define the graph laplacian of a bipartite 


where D\ and D 2 correspond to the degree matrices of A 
and A T respectively. 

If the generalized eigenvalue problem Lz = XDz is written 
for the above laplacian for a bipartite graph and then re- 
arranged, it has been demonstrated [lO that the resulting 
equations define the equations for a singular value decom- 
position of the normalized matrix 

A n = Df 1/2 AD^ 1/2 

Thus instead of finding the second smallest eigenvector cor- 
responding to the second eigenvalue, one could find the left 
and the right singular values in its place. Finding the right 
singular value gives a bi-partitioning of students while the 
left singular value gives a bi-partitioning of the features. 
These can then be used to find the optimal bi-partition as 
defined above. 

Algorithm 1. 

1. Given the co-occurrence or data matrix scaled to be- 
tween 0 and 1 A, form the normalized matrix. 

A n = D- 1/2 AD~ 1/2 

2. Compute the second left and right singular vectors for 
A n , concatenate them together to form a vector z. 

3. Run k-means on this vector to obtain a simultaneous 
clustering of both the students and the features. 

This algorithm can be extended to a multipartition case if 
instead of finding the second singular values, the first log 2 (k) 
singular vectors are found. The rest of the process remains 
the same. 

Note that this algorithm gives a simultaneous clustering of 
the rows and the columns and is restricted in the sense that 
the number of row and columns clusters have to be the same. 
We modify this by running k-means two times. If the num- 
ber of row clusters is k and then the number of column 
vectors is l, then we run k-means on the vector 2 twice, 
once to find k clusters and then to find l clusters. The first 
m elements of the length m + n cluster assignment vector 
run will then correspond to the row clusters and the last n 
elements of the cluster assignment vector in the second run 
will correspond to the column cluster indices. 

3. BAGGING STRATEGY 

The statement of the supervised learning problem in ma- 
chine learning could be roughly stated as follows: Given a 
training set consisting of ordered pairs of feature vectors and 
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their associated labels (which might be discrete or contin- 
uous), the task of a learning algorithm is to learn a func- 
tional map from the feature space to label space. A learn- 
ing algorithm is said to be more powerful if it is able to 
learn mappings such that it can generalize well and make 
correct predictions on test data-points on which it was not 
trained. Since the functional map under consideration might 
be highly non-linear, learning algorithms that output only 
a single mapping (frequently referred to as the hypothesis) 
might suffer from statistical, computational and representa- 
tion issues that restrict them from learning good mappings. 
One way of solving this problem is to transform the fea- 
ture space into a more suitable and “richer” representation 
such that learning using this new representation gives much 
better functional maps as compared to the original represen- 
tation. This is the motivation behind deep learning methods 
which have caused a new wave of excitement in the machine 
community since 2006 T7]. Another way of solving this 
problem atleast partly, is by using ensemble learning meth- 
ods 18], [19], [20' ■ The basic idea behind ensemble methods is 
that they involve running a “base learning algorithm” multi- 
ple times, each time with some change in the representation 
of the input (e.g. only considering a subset of features in 
each run) so that a number of diverse predictions (or maps) 
could be obtained. This diversity in prediction is then ex- 
ploited to get better predictions. Thus ensemble methods 
approach the said problem by both trying to learn multi- 
ple functional maps and also by learning a more distributed 
and hence “richer” representation of the input space at the 
same time. In the next section we describe a method to use 
clustering for bootstrapping. 

3.1 Clustering for Bootstrapping 

In earlier work we introduced the idea of using clustering 
for bootstrapping E> 0, ©. This idea was quite un- 
like other bagging methods which use a random subset to 
bootstrap. Thus, it had the potential advantage that the 
subsets used to bootstrap could be more interpretable. Be- 
fore we generalize this methodology using co-clustering we 
first briefly describe the methodology using clustering. 

The training set was first clustered into k disjoint clusters. 
A linear regression model was trained on each of the clusters 
only based on the training points assigned to that cluster. 
Since each such linear regression was a representative of only 
one cluster, we called it a cluster model. Thus, for a given k, 
there would be k cluster models. But since all the clusters 
are mutually exclusive, the training set is represented by all 
the cluster models taken together. This is called a prediction 
model ( PMk ). For an incoming test point on which a pre- 
diction is to be made, we first identify the cluster that point 
belongs to. After the cluster has been identified, the appro- 
priate cluster model could be used to make a prediction for 
that point. Now note that we don’t specify the number of 
clusters in the above. Hence, we can change the granular- 
ity of the clustering from 1 to some high value, say K. In 
each instance we would get a different prediction model (a 
special case would be PMi, which would basically be when 
one linear regression model is trained on the entire dataset). 
Thus, we would obtain a set of K prediction models each 
of which would make a separate prediction on the test set. 
Since we vary the granularity of the clustering, each of these 
predictions are different, this diversity in prediction could be 



Figure 1: Finding a Prediction Model, PMki with k row 
clusters and l column clusters 


used by averaging all the (or half) the predictions obtained 
to get a single much stronger prediction. 

3.2 Co-Clustering for Bootstrapping 

Note that the clustering is only one-way. That is, bootstrap- 
ping is done by only changing the data instances available 
for each cluster model (by changing the number of cluster 
models itself) but the number of features used in each case is 
the same. A cluster basically is a bunch of rows in the data 
matrix with all columns. A co-cluster on the other hand 
would be a “block” in the data matrix with a sub-set of 
rows and a sub-set of columns assigned to each “co-cluster” . 
Thus a co-clustering could be thought of as a simultaneous 
clustering and dimensionality reduction of the data. Note 
that a clustering is only a special case of co-clustering when 
the columns are not clustered at all (or have only one column 
cluster). 

Clearly, the above bagging methodology can be suitably 
modified using co-clustering. For a given number of row 
clusters k and column clusters l we could have k co-clusters 
where-in each cluster has only some features assigned to it 
(note that the definition is symmetric i.e we could think of 
this as l co-clusters). For each co-cluster we train a sepa- 
rate linear regression model only using the data instances 
and features assigned to it. We thus obtain k Co-Cluster 
Models. Like in the above case for clustering, the combina- 
tion of the k co-cluster models would be considered to be 
a Prediction Model which makes a single prediction on the 
test set. We can then vary k from 1 to some value K and l 
from 1 to some value L. By doing so, we would get a total 
of K x L prediction models. We then average a subset of the 
predictions made by these models to obtain a much stronger 
prediction. 

There are some interesting aspects to such a methodology 
using co-clustering. For k = 4 and l = 4, the grid in Figure 
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Figure 2: Ordering the Co-Cluster Prediction Models, PMki 


[^illustrates all the Prediction Models (PMki) that could be 
obainted by co-clustering. The Prediction Model PMi,i rep- 
resented by (1, 1) is simply the case when there is one data 
cluster and only one feature cluster i.e the original data ma- 
trix itself. The prediction model for this case would simply 
be training a linear regression on the entire dataset, consid- 
ering all the features. The first column of this grid repre- 
sents the case when the number of feature clusters is just 
one, while the number of row clusters are changed. Note 
that this is simply the methodology described above in Sec- 
tion |3.1| using clustering. The first row of this grid is also 
equally interesting. In this case the number of row clusters 
is always one i.e the entire dataset is considered in all co- 
clusters, while the column clusters are successively changed. 
It should be noted that this is a sort of a step-wise regression, 
where a linear regression is trained on the entire dataset but 
the number of features that are used to train it are changed 
(usually reduced as l increases). All the other cases are a 
cross between these two extreme cases. We see that it seems 
plausible that a bagging strategy using co-clustering if av- 
eraged properly could definitely have more predictive power 
as it generates diversity by considering a different subset 
of data instances and features each time, consequently also 
generating a much larger set of predictions. 

3.3 Blending Predictions 

As mentioned before, the method for combining the predic- 
tions returned by the various prediction models is a naive 
averaging strategy. When the prediction models were gener- 
ated by clustering ( PMk ), we either averaged the first K/2 
predictions (where K was the maximum number of clusters) 
[6] or we learned the best number of prediction models that 
could be averaged by an internal cross-validation 6 . The 
averaging idea is not immediately straightforward when co- 
clustering is used to generate the prediction models. This 
is because the prediction models are obtained by changing 
two parameters. It is also observed that prediction models 
with a high k or l return poor accuracies, thus it wouldn’t be 
useful to average predictions from all the PMki models first 
and then PMki models and so on (i.e. traversing the grid 
row- wise or column- wise) . Since high values of k and Z are 
counter-productive, we take the order of the prediction mod- 
els such that the sizes of fc and Z increase uniformly. This 
ordering is illustrated by the curve in Figure [2] The first 
half of this reordered set of predictions are then averaged. 

4. EXPERIMENTAL VALIDATION 

In this section we report experimental results for using co- 
clustering for bagging and compare results with the bench- 


mark ( PMn ) and clustering alone. 


4.1 Dataset Description and Context 

We primarily experiment with two datasets in this study. 
This data was collected to study if dynamic assessment, 
which has long been advocated as an effective method for 
assessment, was actually better than the traditional static 
assessment [31], [22]. Dynamic assessment is an interactive 
approach to student assessment which is primarily based on 
how much help a student requires during a practice test. 
Traditional static testing only takes into account the per- 
centage of questions that the student gets correct. Feng et 
al. [23] showed that features that only recorded how much 
assistance a student got while interacting with a tutor alone 
were better predictors of student performance in post-tests 
held later in the year as compared to how many questions 
students got correct. This was confirmed in subsequent 
studies 0 , 0 - Thus if Co-Clustering is able to improve 
predictions, then this study could further lend weight to the 
idea that dynamic testing is indeed better than static test- 
ing and that we could further improve upon PM\\. It must 
be noted that PMu would correspond to results reported 
in 23 which were better than static assessment. PMu ba- 


sically corresponds to the condition when all the dynamic 
features are considered and all of the training set is used to 
train a predictor. 

The datasets come from the 2004-05 and 2005-06 school 
years, the first two full years when ASSISTments.org was 
used in schools in Massachusetts. ASSISTments is an e- 
learning tutoring system developed at Worcester Polytech- 
nic Institute which assesses students as it assists. These 
datasets contain features that measure the interaction of 
students with the tutor and their actual final grades, which 
they obtained at the end of the year in the Massachusetts 
state test (MCAS). There a total number of six features in 
these datasets 1) DA Original Count is the number of 
questions that the students answered with assistance in the 
dynamic condition. 2) DA Original Percent Correct is 
the percent of questions of feature 1 that students get cor- 
rect . 3) DA Scaffold Percent Correct is the percentage 
on tutorial help questions that students get correct. 4) DA 
Average Time is the average time that a student spends 
on a question 5) DA Average Attempt is the average 
number of attempts students made per question. 6) DA 
Average Hints is the average number of hints that stu- 
dents used. The task is to use these interaction features to 
predict the MCAS scores that students might get at the end 
of the school year. The static condition feature is percentage 
of questions answered correct in static testing. This feature 
is never used for making predictions for the dynamic condi- 
tion. The data in the 2004-05 set (ASSISTments 2004-05) 
is for 628 students, while the 2005-06 data (ASSISTments 
2005-06) is for 761 students. 

For experimentation we do a five fold cross-validation on the 
dataset and report results for the base condition (PMu) and 
the various blended results which were obtained by averag- 
ing as discussed in Section [3. 3| For the sake of comparison 
we also include results with k-means clustering too. In both 
cases we consider the ensembled results, with the top K pre- 
dictions averaged as described in ,0 E and also in Section 
3. 1 Following results in 4] and [5 we report results in terms 
of the mean absolute difference (MAD). 

Finally, for pre-processing: As mentioned in Section [2] to 
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ASSISTments 2004-05 ASSISTments 2005-06 




Figure 3: Performance on the 2004-05 Set Figure 4: Performance on the 2005-06 Set 


obtain a bipartite partitioning A must contain values that 
are either binary or scaled between 0 and 1. Thus, in each 
fold each feature column is scaled to between 0 and 1 so that 
A n could be considered a co-occurrence matrix. This marks 
a slight difference from earlier papers in which the feature 
scaling was done so as to map all the data-points to between 
— 1 and 1 by using the mapminmax command of MATLAB. 
This slight difference might result in a small variation in the 
results. 

4.2 Experimental Results 

We first report results on the ASSISTments 2004-05 dataset. 
The five fold cross-validated results using co-clustering are 
reported in Figure [3] The number of row clusters (fc) and 
the number of column clusters ( l ) were restricted to 4 each. 
This resulted in 16 prediction models. The x-axis in the 
graph represents the first eight prediction models on doing 
co-clustering, while the y-axis simply gives the mean abso- 
lute error. We observe that the accuracy of co-clustering 
alone is quite bad (as seen by the blue line) as compared to 
the baseline ( PMki , which is basically the result for x = 1 in 
this graph. Note that the baseline is the dynamic condition 
of Feng [23] ) . These predictions are those given by the first 
elements of the ordered set of co-cluster prediction models 
as defined in Section [3. 3| However, averaging these predic- 
tion models successively gives better and better predictions 
(as can be seen by the red line). 

Similar results were reported in the ASSISTments 2005-06 
dataset as shown in Figure [4] In this dataset the prediction 
models are far worse than the ensembled results as com- 
pared to the previous dataset. Again, we obtain 16 predic- 
tion models after co-clustering and successively average the 
first eight (the first with second, the first with second and 
third and so on) after they have been arranged in the way 
suggested in Section |3.3| Again the ensembled results do 
much better over the baseline (we report exact figures and 
significance in Tables 1 and 2). 

In Table [T] we compare the mean absolute errors when pre- 
dictions of the first five prediction models are bagged. We 
report results when the Prediction Models are obtained both 
by using co-clustering and using k-means clustering on the 
ASSISTments 2004-05 dataset. The figures in bold indi- 
cate statistical significance over the baseline prediction on 


Table 1: Comparison of predictions based on k-means and 
Co-Clustering for the ASSISTments 2004-05 Dataset. Fig- 
ures in bold indicate significance over the baseline on paired 
t-test. Numbers are Mean Absolute Errors. Also note that 
Pred. Model 1 corresponds to the baseline 


Pred. Models 

Co-Clust 

k-means 

1 

8.7741 

8.7741 

2 

8.7379 

8.7518 

3 

8.7087 

8.6725 

4 

8.6879 

8.7153 

5 

8.6574 

8.7100 


a paired t-test. Results in Table [2] compare the predictions 
obtained by using co-clustering and k-means for bagging on 
the ASSISTments 2005-06 dataset. 

The results are significantly better over the baseline and 
also indicate that the dynamic assessment condition returns 
a much better prediction of student test scores as compared 
to the static condition. It has already been noted that the 
static test condition results are significantly worse as com- 
pared to even the baseline by 23 and [2], and thus we don’t 
report results for the static condition. 


5. DISCUSSION AND FUTURE WORK 

The datasets that were used for the validation of this bag- 
ging technique, which is based on co-clustering were not very 
large and did not have a large number of columns. Thus, 


Table 2: Comparison of predictions based on k-means and 
Co-Clustering for the ASSISTments 2005-06 Dataset. Fig- 
ures in bold indicate significance over the baseline on paired 
t-test. Numbers are Mean Absolute Errors. Also note that 
Pred. Model 1 corresponds to the baseline 


Pred. Models 

Co-Clust 

k-means 

1 

7.9822 

7.9822 

2 

7.7716 

7.8185 

3 

7.5990 

7.8034 

4 

7.4680 

7.7815 

5 

7.5503 

7.6487 
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these results were initially surprising. One would imagine 
that in a dataset which has a small number of features, per- 
haps a feature selection might not be too helpful. However, 
our experiments show us otherwise. The results that we ob- 
tain, while modest improvements show that this technique 
though simple can give access to a novel source of variance 
in the data. It can potentially also have some nice prop- 
erties in terms of returning simpler and more interpretable 
groups. For example, it was earlier pointed out that one 
row of the prediction models were actually nearly like a lin- 
ear regression model in which the features are successively 
eliminated. At the same time it was observed that one col- 
umn of the prediction models were actually just the various 
prediction models that we obtained on clustering alone as 
reported in some previous work. It would be interesting to 
see how the Co-Clusters (which are basically blocks in the 
data matrix) on a student-item dataset would pair clusters 
of student proficiency to clusters of item performance which 
could be seen as a sort of a subject treatment interaction. 
In the literature, it has been said that the real strength of co- 
clustering is with binary valued data, co-occurrence tables 
and basically in scenarios which involve collaborative filter- 
ing. Hence, datasets which are basically a student by item 
matrix would be an ideal candidate for trying out this tech- 
nique. In the KDD Cup 2010 Toscher and Jahrer modelled 
student response data as a collaborative filtering task and 
used matrix factorization techniques for the same. Given 
the connections of co-clustering with matrix factorization, 
it is worth investigating how useful it could be in such a 
setting. 

In 3], the authors clustered students based on tutor interac- 
tion features and then trained separate Knowledge Tracing 
models for students based on the cluster they were in. This 
was done so because it was not possible to cluster the item 
sequences directly and an indirect approach had to be taken. 
This co-clustering technique seems to give an alternative by 
which such matrices might be clustered more readily with- 
out the need to cluster the tutor interaction features. 

In summary, in this paper we propose a bagging technique 
that uses co-clustering and demonstrate that it’s perfor- 
mance is better than that obtained by bagging using clus- 
tering. We also suggest that it is most suitable for datasets 
which are like co-occurrence tables and believe that it would 
be a good direction for future work since such student-item 
datasets are usually of this form. 
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